Back

Nature Computational Science

Springer Science and Business Media LLC

All preprints, ranked by how well they match Nature Computational Science's content profile, based on 50 papers previously published here. The average preprint has a 0.05% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Scalable deep-learning-based inference of time-varying transmission dynamics from outbreak phylogenies

XIE, R.; Zhukova, A.; Pena, P. G.; Iglesias, G.; Hu, S.; Wang, J.; Tsang, T. K.; Dhanasekaran, V.; Kraemer, M. U. G.; Pybus, O. G.; Gascuel, O.

2026-05-10 infectious diseases 10.64898/2026.05.07.26352673 medRxiv
Top 0.1%
22.5%
Show abstract

Infectious disease dynamics can be inferred from pathogen genomic data using phylodynamic methods, but the applicability of many such approaches to large data sets is constrained by computational cost. Recent deep-learning approaches to phylodynamics have improved scalability, yet challenges remain when genetic divergence is limited during fast spreading outbreaks. To address this, we use pathogen-specific models to show that deep-learning models trained on outbreak-like phylogenies can accurately estimate the reproductive number (R) when both the birth-death model and the expected phylogenetic resolution are matched to the target pathogen, highlighting the importance of realistic training conditions. Focusing on three major respiratory pathogens of public health importance (SARS-CoV-2, seasonal human influenza virus, and respiratory syncytial virus (RSV)), we introduce PhyloRt, a scalable framework for estimating the time-varying reproductive number (Rt) from large outbreak phylogenies. PhyloRt decomposes large trees into overlapping subtrees and applies a hierarchical deep-learning-based inference strategy to classify subtrees as exhibiting constant or time-varying reproduction numbers, enabling identifiable and computationally efficient estimation of Rt as a piecewise-constant trajectory through time. Applications to SARS-CoV-2 and influenza outbreaks show that PhyloRt recovers transmission dynamics consistent with estimates derived from mathematical epidemiological and Bayesian phylodynamic analyses. Our work enables scalable and rapid estimation of time-varying transmission dynamics from very large-scale outbreak genomic data sets, supporting real-time genomic epidemiology of emerging pathogens. SignificanceEstimating changes in transmission dynamics over time is important for responding to infectious disease outbreaks. Current methods mostly rely on reported case data from epidemiological surveillance, which can be biased or incomplete due to variable testing capabilities, particularly in resource-limited settings. A complementary approach is to use viral genomes as an alternative data source. However, inferences from genomic data can be computationally intensive and have mainly been applied retrospectively. We present PhyloRt, a scalable deep-learning-based phylodynamic framework that enables fast inference of the time-varying reproductive number (Rt) from large outbreak phylogenies. Our approach is widely applicable and provides a practical approach to monitoring epidemic dynamics, complementing traditional surveillance and supporting timely public health decision-making.

2
Preparing For the Next Pandemic: Learning Wild Mutational Patterns At Scale For For Analyzing Sequence Divergence In Novel Pathogens

Li, J.; Li, T.; Chattopadhyay, I.

2020-07-19 infectious diseases 10.1101/2020.07.17.20156364 medRxiv
Top 0.1%
22.1%
Show abstract

As we begin to recover from the COVID-19 pandemic, a key question is if we can avert such disasters in future. Current surveillance protocols generally focus on qualitative impact assessments of viral diversity 1. These efforts are primarliy aimed at ecosystem and human impact monitoring, and do not help to precisely quantify emergence. Currently, the similarity of biological strains is measured by the edit distance or the number of mutations that separate their genomic sequences 2-6, e.g. the number of mutations that make an avian flu strain human-adapted. However, ignoring the odds of those mutations in the wild keeps us blind to the true jump risk, and gives us little indication of which strains are more risky. In this study, we develop a more meaningful metric for comparison of genomic sequences. Our metric, the q-distance, precisely quantifies the probability of spontaneous jump by random chance. Learning from patterns of mutations from large sequence databases, the q-distance adapts to the specific organism, the background population, and realistic selection pressures; demonstrably improving inference of ancestral relationships and future trajectories. As important application, we show that the q-distance predicts future strains for seasonal Influenza, outperforming World Health Organization (WHO) recommended flu-shot composition almost consistently over two decades. Such performance is demonstrated separately for Northern and Southern hemisphere for different subtypes, and key capsidic proteins. Additionally, we investigate the SARS-CoV-2 origin problem, and precisely quantify the likelihood of different animal species that hosted an immediate progenitor, producing a list of related species of bats that have a quantifiably high likelihood of being the source. Additionally, we identify specific rodents with a credible likelihood of hosting a SARS-CoV-2 ancestor. Combining machine learning and large deviation theory, the analysis reported here may open the door to actionable predictions of future pandemics.

3
Reconstructing whole-brain structure and dynamics using imaging data and personalized modeling

Fabbrizzi, M.; Amato, L. G.; Martinelli, L.; Carpaneto, J.; Bartolini, E.; Calderoni, S.; Retico, A.; Vergani, A. A.; Mazzoni, A.

2025-01-10 neurology 10.1101/2025.01.06.24319726 medRxiv
Top 0.1%
18.6%
Show abstract

Brain structure plays a pivotal role in shaping neural dynamics. Current models lack the anatomical and functional resolution needed to accurately capture both structural and dynamical features of the human brain. Here, we introduce the FEDE (high FidElity Digital brain modEl) pipeline, generating anatomically accurate brain digital twins from imaging data. Using advanced techniques of anatomical tissue segmentation and finite-element analysis, FEDE reconstructs brain structure with high spatial resolution, while also replicating whole-brain neural activity. We demonstrated its application by creating the first brain digital twin of a toddler with autism spectrum disorder (ASD). Through parameter optimization, FEDE replicated both time-frequency and spatial features of recorded neural activity. Notably, FEDE predicted patient-specific aberrant values of excitation to inhibition ratio, coherently with ASD pathophysiology. FEDE represents a significant leap forward in brain modeling, paving the way for more effective applications of digital twin in experimental and clinical settings.

4
Personalized Feature Statistics: Individual-Level Variant Inference under Genetic Ancestry Continuum

Wang, J. F.; Yu, R.; Edelson, J.; Park, J.; Le Guen, Y.; Liu, X.; Belloy, M.; Ionita-Laza, I.; Greicius, M.; Tang, H.; He, Z.

2026-04-29 neurology 10.64898/2026.04.28.26351879 medRxiv
Top 0.1%
18.2%
Show abstract

Genome-wide association studies (GWAS) have successfully identified numerous genetic variants associated with complex diseases. However, the extent to which the effects of these variants vary across populations of diverse ancestries remains poorly understood. Furthermore, in these contexts genetic ancestry is treated as a categorical variable, thereby oversimplifying its continuous nature and the more nuanced ways in which it can influence genetic effects on disease. Here, we propose personalized feature statistics (PFstatistics), a statistical framework that quantifies the importance of genetic variants to a phenotype based on each individuals ancestry background, and profiles heterogeneous genetic effects across the genetic ancestry continuum. We demonstrate the utility of this framework through both simulations and real data analysis using sequencing data from ancestrally diverse cohorts in the Alzheimers Disease Sequencing Project (ADSP). We show that Alzheimers Disease (AD) risk variants span a spectrum from ancestry-homogeneous to ancestry-dependent effects, and that PFstatistics characterizes this spectrum at individual resolution across the ancestry continuum. PFstatistics also provides individual-level variant selection with FDR controlled at a target level, yielding distinct selection sets that vary across individuals according to their ancestry background. While demonstrated in the context of genetic ancestry, the proposed method is broadly applicable to other heterogeneity features such as environmental factors, offering a robust tool for understanding complex genetic contributions across diverse populations.

5
Virtual Epileptic Patient (VEP): Data-driven probabilistic personalized brain modeling in drug-resistant epilepsy

Wang, H. E.; Woodman, M.; Triebkorn, P.; Lemarechal, J.-D.; Jha, J.; Dollomaja, B.; Vattikonda, A. N.; Sip, V.; Medina Villalon, S.; Hashemi, M.; Guye, M.; Scholly, J.; Bartolomei, F.; Jirsa, V.

2022-01-21 neurology 10.1101/2022.01.19.22269404 medRxiv
Top 0.1%
17.3%
Show abstract

One-third of 50 million epilepsy patients worldwide suffer from drug resistant epilepsy and are candidates for surgery. Precise estimates of the epileptogenic zone networks (EZNs) are crucial for planning intervention strategies. Here, we present the Virtual Epileptic Patient (VEP), a multimodal probabilistic modeling framework for personalized end-to-end analysis of brain imaging data of drug resistant epilepsy patients. The VEP uses data-driven, personalized virtual brain models derived from patient-specific anatomical (such as T1-MRI, DW-MRI, and CT scan) and functional data (such as stereo-EEG). It employs Markov Chain Monte Carlo (MCMC) and optimization methods from Bayesian inference to estimate a patients EZN while considering robustness, convergence, sensor sensitivity, and identifiability diagnostics. We describe both high-resolution neural field simulations and a low-resolution neural mass model inversion. The VEP workflow was evaluated retrospectively with 53 epilepsy patients and is now being used in an ongoing clinical trial (EPINOV).

6
Unsupervised seizure annotation and detection with neural dynamic divergence

Ojemann, W. K. S.; Xu, Z.; Shi, H.; Walsh, K.; Pattnaik, A. R.; Sinha, N.; Lavelle, S.; Aguila, C.; Gallagher, R.; Revell, A. Y.; LaRocque, J. J.; Korzun, J.; Kulick-Soper, C. V.; Zhou, D. J.; Galer, P. D.; Sinha, S. R.; Shinohara, R.; Davis, K. A.; Litt, B.; Conrad, E. C.

2026-02-17 neurology 10.64898/2026.02.15.26346325 medRxiv
Top 0.1%
17.3%
Show abstract

Annotating seizure onset and spread in intracranial EEG is essential for epilepsy surgical planning, yet manual annotation is unreliable and cannot scale to large datasets. We introduce Neural Dynamic Divergence (NDD), an unsupervised framework that detects seizure activity by measuring deviation from patient-specific baseline neural dynamics using autoregressive models. NDD requires no labeled training data and adapts to individual patients, channels, and brain states. Validating against expert consensus annotations from 46 seizures, NDD achieves human-level agreement ({phi} = 0.58 vs. inter-rater{phi} = 0.64) and outperforms existing algorithms on 1,019 seizures with soft labels (AUROC = 0.87). We demonstrate clinical utility by automatically annotating 2,017 seizures, revealing that seizure spread patterns distinguish epilepsy subtypes and predict surgical outcomes. NDD also generalizes to continuous ICU scalp EEG monitoring (AUROC = 0.77). We provide NDD as an open-source Python package to enable scalable seizure annotation across research centers.

7
Spectral normative modeling of brain structure

Mansour L, S.; Di Biase, M. A.; Yan, H.; Xue, A.; Venketasubramanian, N.; Chong, E.; Alexander-Bloch, A.; Chen, C.; Zhou, J. H.; Yeo, B. T. T.; Zalesky, A.

2025-01-21 radiology and imaging 10.1101/2025.01.16.25320639 medRxiv
Top 0.1%
14.5%
Show abstract

Normative modeling in neuroscience aims to characterize interindividual variation in brain phenotypes and thus establish reference ranges, or brain charts, against which individual brains can be compared. Normative models are typically limited to coarse spatial scales due to computational constraints, limiting their spatial specificity. They additionally depend on fixed regions from fixed parcellation atlases, restricting their adaptability to alternative parcellation schemes. To overcome these key limitations, we propose spectral normative modeling (SNM), which leverages brain eigenmodes for efficient spatial reconstruction to generate normative ranges for arbitrary new regions of interest. Benchmarking against conventional counterparts, SNM achieves a 98.3% speedup in computing accurate normative ranges across spatial scales, from millimeters to the whole brain. We demonstrate its utility by elucidating high-resolution individual cortical atrophy patterns and characterizing the heterogeneous nature of neurodegeneration in Alzheimers disease. SNM lays the groundwork for a new generation of spatially precise brain charts, offering substantial potential to drive advances in individualized precision medicine.

8
PANDORA: Population Archive of Neuroimaging Data Organized for Rapid Analysis

Abivardi, A.; Webster, M.; McCarthy, P.; Alfaro-Magro, F.; Radosavljevic, L.; Miller, K. L.; Jbabdi, S.; Woolrich, M. W.; Gong, W.; Beckmann, C. F.; Elliott, L. T.; Nichols, T. E.; Smith, S. M.

2026-01-06 radiology and imaging 10.64898/2026.01.05.26343425 medRxiv
Top 0.1%
14.1%
Show abstract

Population-scale neuroimaging allows for novel biological discovery, but voxelwise analyses are computationally paralyzing and noisy, whereas imaging-derived phenotypes discard crucial spatial detail. We introduce PANDORA (Population Archive of Neuroimaging Data Organized for Rapid Analysis), a data-adaptive modelling platform designed to resolve this trade-off. PANDORA has encoded brain MRI data comprising 98 sub-modalities from over 80,000 UK Biobank participants in a highly efficient supervoxel representation. By performing statistical regressions directly within this compressed embedding, PANDORA reduces storage by up to 99% and accelerates computation 10-fold, while acting as a spatial denoiser to enhance statistical power. PANDORA also includes the full-resolution voxelwise ground-truth data, curated imaging confound variables, and a fast analysis tool achieving whole brain, voxelwise population-level regression in seconds to minutes. We showcase PANDORAs ability to reproduce known patterns and reveal new associations including trauma, anxiety/depression, autism polygenic scores, and EPHA3.

9
GenBrain: A Generative Foundation Model of Multimodal Brain Imaging

Yang, C.; Feng, J.; Beckmann, C. F.; Smith, S. M.; Gong, W.

2025-12-20 radiology and imaging 10.64898/2025.12.19.25342614 medRxiv
Top 0.1%
14.0%
Show abstract

Neuroimaging faces a reproducibility crisis, where studies on small, heterogeneous datasets produce unreliable brain-wide associations and AI models that fail to generalize. To address this, we introduce GenBrain, a generative foundation model pretrained on approximately 1.2 million 3D scans from over 44,000 individuals across 34 imaging modalities to learn a population prior of brain structure and function. Crucially, GenBrain enables rapid, data-efficient adaptation, allowing any targeted study to generate biologically valid synthetic cohorts, conditioned on demographics, disease status, or other modalities, to augment statistical power and enhance generalizability. We demonstrate GenBrains transformative utility across 81 independent datasets spanning diverse populations, protocols, and clinical conditions. For image-level tasks, it achieves state-of-the-art performance in image enhancement and cross-modality synthesis while preserving subject-specific neurobiology. In population neuroscience, synthetic cohorts from GenBrain stabilize effect-size estimates and significantly improve the reproducibility of brain-wide association studies. For clinical AI, disease-specific fine-tuning of GenBrain substantially boosts the cross-site generalizability of prediction models. Finally, we prove its direct translational value when adapted to unseen modality and scarce clinical stroke data. GenBrain significantly improves predictions of acute stroke severity and chronic aphasia, demonstrating actionable utility under extreme data scarcity. By empowering small-scale studies with large-scale population priors, GenBrain provides a unified framework for more reproducible and clinically generalizable neuroimaging analysis.

10
MDZip: Neural Compression of Molecular Dynamics Trajectories for Scalable Storage and Ensemble Reconstruction

De Silva, N.; perez, a.

2025-08-01 biophysics 10.1101/2025.07.31.667955 medRxiv
Top 0.1%
12.6%
Show abstract

The size of molecular dynamics (MD) trajectories remains a major obstacle for data sharing, long-term storage, and ensemble analysis at scale. Existing solutions often rely on frame subsampling or reduced atom representations, which limit the utility of shared datasets. Here, we present MDZip, a neural compression framework based on convolutional autoencoders trained per system to reconstruct atomic trajectories with high geometric fidelity from compact latent representations. MDZip achieves over 95% reduction in storage size across a diverse benchmark of proteins, protein-peptide complexes, and nucleic acids. Despite operating in a physics-agnostic manner, the reconstructed trajectories accurately preserve ensemble-level features, including RMSD fluctuations, pairwise distance distributions, radius of gyration, and projections onto principal and time-lagged independent components. A residual (skip-connected) autoencoder variant consistently improves reconstruction accuracy and reduces outliers. While local structural deviations can impair energetic fidelity, short energy minimization partially recovers physically reasonable conformations. This framework enables customizable compression-accuracy trade-offs and supports a modular workflow for sharing latent representations, decoder models, and reconstruction protocols. MDZip offers a scalable solution to current storage limitations, facilitating broader dissemination of MD data without sacrificing essential dynamical information.

11
Characterization of long-term patient-reported symptoms of COVID-19: an analysis of social media data

Banda, J. M.; Adderley, N.; Ahmed, W.-U.-R.; AlGhoul, H.; Alser, O.; Alser, M.; Areia, C.; Cogenur, M.; Fister, K.; Gombar, S.; Huser, V.; Jonnagaddala, J.; Lai, L.; Leis, A.; Mateu, L.; Mayer, M. A.; Minty, E.; Morales, D. R.; Natarajan, K.; Paredes, R.; Periyakoil, V. S.; Prats-Uribe, A.; Ross, E. G.; Singh, G. V.; Subbian, V.; Vivekanantham, A.; Prieto-Alhambra, D.

2021-07-15 infectious diseases 10.1101/2021.07.13.21260449 medRxiv
Top 0.1%
12.5%
Show abstract

As the SARS-CoV-2 virus (COVID-19) continues to affect people across the globe, there is limited understanding of the long term implications for infected patients1-3. While some of these patients have documented follow-ups on clinical records, or participate in longitudinal surveys, these datasets are usually designed by clinicians, and not granular enough to understand the natural history or patient experiences of long COVID. In order to get a complete picture, there is a need to use patient generated data to track the long-term impact of COVID-19 on recovered patients in real time. There is a growing need to meticulously characterize these patients experiences, from infection to months post-infection, and with highly granular patient generated data rather than clinician narratives. In this work, we present a longitudinal characterization of post-COVID-19 symptoms using social media data from Twitter. Using a combination of machine learning, natural language processing techniques, and clinician reviews, we mined 296,154 tweets to characterize the post-acute infection course of the disease, creating detailed timelines of symptoms and conditions, and analyzing their symptomatology during a period of over 150 days.

12
Interdisciplinary modelling and forecasting of dengue

Mills, C.; Kraemer, M. U. G.; Donnelly, C. A.

2024-10-18 infectious diseases 10.1101/2024.10.18.24315690 medRxiv
Top 0.1%
12.3%
Show abstract

Understanding the past, current, and future dynamics of dengue epidemics is challenging yet increasingly important for global public health. Using data from northern Peru across 2010 - 2021, we introduce a multi-model approach that integrates new and existing techniques for understanding and predicting dengue epidemics. Using wavelet analyses, we unveil spatiotemporal patterns and estimate space-varying epidemic drivers across shorter and longer dengue cycles, while our Bayesian hierarchical model allows us to quantify the timing, structure, and intensity of such climatic influences. For forecasting, as a single model is generally sub-optimal, we introduce trained and untrained probabilistic ensembles. In settings that mirror real-world implementations, we develop climate-informed and covariate-free deep learning forecasting models involving foundational time series, temporal convolutional networks, and conformal inference. We complement modern techniques with statistically principled training, assessment, and benchmarking of ensembles, alongside interpretable metrics for outbreak detection to disseminate outputs with communities and public health authorities. Our ensembles generally outperformed individual models across space and time. Looking forward, whether the public health objective is to learn from the past and/or to predict future dengue epidemic dynamics, our multi-model approach can be used to inform the decision-making of public health authorities.

13
Robust uncertainty quantification in popular estimators of the instantaneous reproduction number

Steyn, N.; Parag, K. V.

2024-10-22 infectious diseases 10.1101/2024.10.22.24315918 medRxiv
Top 0.1%
12.2%
Show abstract

The instantaneous reproduction number (Rt) is a key measure of the rate of spread of an infectious disease. Correctly quantifying uncertainty in Rt estimates is crucial for making well-informed decisions. Popular Rt estimators leverage smoothing techniques to distinguish signal from noise. Examples include EpiEstim and EpiFilter, which are both controlled by a "smoothing parameter" that is traditionally selected by users. We demonstrate that the values of these smoothing parameters are unknown, vary markedly with epidemic dynamics, and show that data-driven smoothing is crucial for accurate uncertainty quantification of Rt estimates. We derive model likelihoods for the smoothing parameters in both EpiEstim and EpiFilter and develop a Bayesian framework to automatically marginalise these parameters when fitting to epidemiological time-series data. This yields novel marginal posterior predictive distributions which prove integral to rigorous model evaluation. Applying our methods, we find that default parameterisations of these widely-used estimators can negatively impact Rt inference, delaying detection of epidemic growth, and misrepresenting uncertainty (typically producing overconfident estimates), with implications for public health decision-making. Our extensions mitigate these issues, provide a principled approach to uncertainty quantification, improve the robustness of real-time Rt inference, and facilitate model comparison using observable quantities.

14
Probabilistic Mapping and Automated Segmentation of Human Brainstem White Matter Bundles

Olchanyi, M.; Schreier, D. R.; Li, J.; Maffei, C.; Sorby-Adams, A.; Kinney, H. C.; Healy, B. C.; Freeman, H. J.; Shless, J.; Destrieux, C.; Tregidgo, H.; Iglesias, J. E.; Brown, E. N.; Edlow, B. L.

2025-05-05 neurology 10.1101/2025.05.01.25326687 medRxiv
Top 0.1%
12.2%
Show abstract

Brainstem white matter bundles are essential conduits for neural signaling involved in modulation of vital functions ranging from homeostasis to human consciousness. Their architecture forms the anatomic basis for brainstem connectomics, subcortical mesoscale circuit models, and deep brain navigation tools. However, their small size and complex morphology compared to cerebral white matter structures makes mapping and segmentation challenging in neuroimaging. This results in a near absence of automated brainstem white matter tracing methods. We leverage diffusion MRI tractography to create BrainStem Bundle Tool (BSBT), which segments eight key white matter bundles in the rostral brainstem. BSBT performs automated segmentation on a custom probabilistic fiber map generated from tractography with a convolutional neural network architecture tailored for detection of small structures. We demonstrate BSBTs robustness across diffusion MRI acquisition protocols through validation on healthy subject in vivo scans and ex vivo scans of brain specimens with corresponding histology. Using BSBT, we reveal distinct brainstem white matter bundle alterations in Alzheimers disease, Parkinsons disease, and acute traumatic brain injury cohorts through tract-based analysis and classification tasks. Finally, we provide proof-of-principle evidence supporting the prognostic utility of BSBT in a longitudinal analysis of coma recovery. BSBT creates opportunities to automatically map brainstem white matter in large imaging cohorts and investigate its role in a broad spectrum of neurological disorders.

15
FrameDiPT: SE(3) Diffusion Model for Protein Structure Inpainting

Zhang, C.; Leach, A.; Makkink, T.; Arbesu, M.; Kadri, I.; Luo, D.; Mizrahi, L.; Krichen, S.; Lang, M.; Tovchigrechko, A.; Lopez Carranza, N.; Sahin, U.; Beguir, K.; Rooney, M.; Fu, Y.

2024-01-20 immunology 10.1101/2023.11.21.568057 medRxiv
Top 0.1%
11.0%
Show abstract

Protein structure prediction field has been revolutionised by deep learning with protein folding models such as AlphaFold 2 and ESMFold. These models enable rapid in silico prediction and have been integrated into de novo protein design and protein-protein interaction (PPI) prediction. However, biologically relevant features dependent on conformational distributions cannot be estimated with these models. Diffusion models, a novel class of generative models, have been developed to learn conformational distributions and applied to de novo protein design. Limited work has been done on protein structure inpainting, where a masked section is recovered by simultaneously conditioning on its sequence and the rest of the structure. In this work, we propose FrameDiff inPainTing (FrameDiPT), a generalised model for protein inpainting. This is important for T-cells given the hyper-variability of the complementarity determining region (CDR) loops. We evaluated the model on CDR loop design for T-cell receptors and achieved comparable prediction accuracy to ProteinGenerator and RFdiffusion with limited training data and learnable parameters. Different from deterministic structure prediction models, FrameDiPT captures the conformational distribution at different regions and binding states, highlighting a key advantage of generative models. The model and inference code have been released1.

16
ci-fGBD: Cluster-Integrated Fast Generalized Bruhat Decomposition for Multimodal Data Clustering in Alzheimer's Disease.

Thakur, L. S.; Bharj, G.; Sangabattula, L.; Malik, B.

2025-09-05 neurology 10.1101/2025.09.03.25334761 medRxiv
Top 0.1%
10.2%
Show abstract

Multimodal biomedical datasets, such as those from neurodegenerative disease cohorts, present significant challenges in stratifying heterogeneous patient populations due to missing values, high dimensionality, and modality-specific biases. Traditional clustering methods often require extensive preprocessing and fail to integrate heterogeneous data types effectively. We introduce ci-fGBD (Cluster-Integrated Fast Generalized Bruhat Decomposition), a novel matrix factorization and clustering framework that natively operates on block-structured, multimodal datasets. ci-fGBD extends the classical Bruhat decomposition by jointly learning latent representations and patient clusters while automatically harmonizing contributions across diverse modalities, including neuroimaging, cognitive assessments, genomics, wearable sensors, and environmental exposures. Benchmarking against standard methods on real datasets demonstrates that ci-fGBD consistently identifies clinically meaningful subgroups, capturing subtle biological, cognitive, and demographic heterogeneity in Alzheimers disease cohorts with superior interpretability and robustness.

17
Privacy-Preserving Multivariate Bayesian Regression Models for Overcoming Data Sharing Barriers in Health and Genomics

Sorensen, I. F.; Sorensen, P.

2025-07-30 health informatics 10.1101/2025.07.30.25332448 medRxiv
Top 0.1%
10.2%
Show abstract

We present multivariate Bayesian regression models specifically designed to over-come data-sharing barriers in health and genomics. These multi-response models are well suited for scenarios where data must remain decentralized due to privacy, intellectual property, or regulatory constraints. In extensive simulation studies, our approach consistently outperformed traditional single-response models trained on individual datasets, particularly under real-world conditions such as low signal, unbalanced cohorts, and high-dimensional feature spaces. For the first time, we demonstrate that multivariate Bayesian regression can be implemented using or-thogonal transformations of sufficient statistics, enabling fully privacy-preserving analysis without sharing individual-level data. The models are scalable, inter-pretable, and applicable to predictive tasks across diverse collaborators, supporting secure data-driven research in domains such as clinical trials, biomarker discovery, and precision health.

18
Reward-Guided Generation Improves the Scientific Utility of Synthetic Biomedical Data

Jackson, N. J.; Espinosa-Dice, N.; Yan, C.; Malin, B. A.

2026-03-16 health informatics 10.64898/2026.03.11.26348077 medRxiv
Top 0.1%
10.1%
Show abstract

Synthetic data generation is a promising approach for biomedical data sharing and dataset augmentation, yet existing methods lack mechanisms to preserve statistical properties necessary for scientific analysis. To address this, we introduce RLSYN+REG, a reinforcement learning-driven generative model, which encourages that regression models trained on synthetic data reproduce the coefficients and predictions of their real-data counterparts. We evaluate RL-SO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW on MIMIC-III and the American Community Survey (ACS) across regression model reproduction, fidelity to real data, and privacy. Synthetic data from RLSO_SCPLOWYNC_SCPLOW+RO_SCPLOWEGC_SCPLOW substantially improves upon that of RLSO_SCPLOWYNC_SCPLOW, raising correlations between real and synthetic regression coefficients from 0.054 to 0.600 on MIMIC-III and from 0.160 to 0.376 on ACS. Predictive performance also improves, reducing the gap between real-data baselines by 81.4% and 97.6% on MIMIC-III and ACS, respectively. These improvements come with negligible cost to fidelity or privacy and are robust to reductions in training data.

19
Data-optimal scaling of paired antibody language models

Neyestanak, M. S.; Burbach, S. M.; Ng, K.; Gangavarapu, P.; Hurtado, J.; Magura, J.; Ismail, N.; Muema, D.; Ndung'u, T.; Ward, A.; Briney, B.

2025-09-06 immunology 10.1101/2025.09.02.673765 medRxiv
Top 0.1%
10.1%
Show abstract

Scaling laws for large language models in natural language domains are typically derived under the assumption that performance is primarily compute-constrained. In contrast, antibody language models (AbLMs) trained on paired sequences are primarily data-limited, thus requiring different considerations. To explore how model size and data scale affect AbLM performance, we trained 15 AbLMs across all pairwise combinations of five model sizes and three training data sizes. From these experiments, we derive an AbLM-specific scaling law and estimate that training a data-optimal AbLM equivalent of the highly performant 650M-parameter ESM-2 protein language model would require [~]5.5 million paired antibody sequences. Evaluation on multiple downstream classification tasks revealed that significant performance gains emerged only with sufficiently large model size, suggesting that in data-limited domains, improved performance depends jointly on both model scale and data volume.

20
Multiscale Hyperbolic Embedding for Cell Hierarchies in Large-Scale Bioinformatics Data

Yao, M.; Praturu, A.; Sharpee, T.

2025-10-01 biophysics 10.1101/2025.09.29.679407 medRxiv
Top 0.1%
10.1%
Show abstract

The increasing size of datasets poses challenges for their visualization and interpretation, highlighting the need for scalable and effective analysis methods. Hyperbolic embedding have shown strong potential in capturing complex hierarchical structures across diverse systems. However, existing hyperbolic embedding methods typically operate with fixed curvature and have difficulties scaling to large datasets. To address these limitations, we propose MuH-MDS, a novel multiscale algorithm for hyperbolic multidimensional scaling that uses "adiabatic" approximation from physics to optimize local positions while keeping cluster centroid fixed. MuH-MDS improves computing time by 103 compared to previous methods and is able to handle large datasets comprising over 80, 000 samples. We validate the method on a number of datasets, including a large-scale C. elegans embryogenesis scRNA-seq dataset with over 80,000 samples. Here, MuH-MDS uncovers intrinsic hierarchical structure, and achieves improved pseudotime inference and lineage analysis compared to UMAP and other methods. Unlike UMAP and t-SNE, which emphasize local structure at the expense of global coherence and metric accuracy, MuH-MDS preserves global hierarchy in a metrically faithful manner, maintaining key relationships across scales.